Understanding Data: VIMO Analysis
Contents
Understanding Data: VIMO Analysis#
In this section, I will demonstrate key differences between Python and Google Sheets when it comes to cleaning a dataset and conducting a VIMO analysis. This analysis aims to identify Valid, Invalid, Missing, and Outlier variables CITE STATISTICS CANANADA VIDEO.
For the purpose of this assignment, I will not do a complete VIMO analysis. Instead, let’s find some summary statistics about our data.
Let’s start by importing pandas and our dataset using the code we saw in CROSS LIST GETTING DATA SECTION. We will also print the first five rows.
import pandas as pd
permits = pd.read_csv('https://raw.githubusercontent.com/jsmarier/course-datasets/main/ottawa-building-permits-2021.csv')
permits.head()
| ST # | ROAD | PC | WARD | PLAN | LOT | CONTRACTOR | BLG TYPE | MUNICIPALITY | DESCRIPTION | D.U. | VALUE | FT2 | PERMIT# | APPL. TYPE | ISSUED DATE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 372 | BRETTONWOOD RIDGE | K2T0H8 | Ward 4 | 4M1502 | 38 | POOLARAMA | Single | Kanata | Install an enclosure for a semi inground pool ... | 0 | 0 | 0 | 1807192 | Pool Enclosure | 2021-Oct-06 |
| 1 | 11 | BEGGS CRT | K0A2E0 | Ward 21 | 4M1579 | 1 | CONTRACTOR UNKNOWN | Single | Rideau | Intall an enclosure for a semi-inground pool (... | 0 | 0 | 0 | 2008197 | Pool Enclosure | 2021-Jul-06 |
| 2 | 5500 | CEDAR DR | K4M1B4 | Ward 20 | NaN | 18 | BOISVERT, JOHN | Single | Osgoode | Install an enclosure for a hot tub (rear yard) | 0 | 0 | 0 | 2010072 | Pool Enclosure | 2021-Jul-26 |
| 3 | 1458 | CYRVILLE RD | K1B3L9 | Ward 11 | NaN | NaN | TIKKUN CONSTRUCTION | Retail | Gloucester | Tenant fit-up to a 1 storey mixed use building... | 0 | 75,000 | 3767 | 2100001 | Construction | 2021-Jan-04 |
| 4 | 360 | LAURIER AVE W | K1P1C8 | Ward 14 | 2996 | 28 & 29 | CONTRACTOR UNKNOWN | Office | Old Ottawa | Interior alterations on the 2nd floor of an 11... | 0 | 44,486 | 753 | 2100002 | Construction | 2021-Jan-04 |
Using Python’s .info() and .describe()#
As we saw in the previous section, we can use the permits.info() to show the data type of each column, as well as the number of non-null variables:
permits.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14076 entries, 0 to 14075
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ST # 14076 non-null object
1 ROAD 14076 non-null object
2 PC 12754 non-null object
3 WARD 14070 non-null object
4 PLAN 10928 non-null object
5 LOT 7524 non-null object
6 CONTRACTOR 14067 non-null object
7 BLG TYPE 14074 non-null object
8 MUNICIPALITY 14076 non-null object
9 DESCRIPTION 14076 non-null object
10 D.U. 14076 non-null int64
11 VALUE 14076 non-null object
12 FT2 14076 non-null int64
13 PERMIT# 14076 non-null int64
14 APPL. TYPE 14076 non-null object
15 ISSUED DATE 14076 non-null object
dtypes: int64(3), object(13)
memory usage: 1.7+ MB
We can also use .describe() to find summary statistics about the numeric columns in our dataset.
permits.describe()
| D.U. | FT2 | PERMIT# | |
|---|---|---|---|
| count | 14076.000000 | 1.407600e+04 | 1.407600e+04 |
| mean | 1.948778 | 4.706217e+03 | 2.105802e+06 |
| std | 8.012956 | 4.543035e+04 | 4.400099e+03 |
| min | -18.000000 | -1.201940e+05 | 1.807192e+06 |
| 25% | 0.000000 | 5.400000e+01 | 2.102822e+06 |
| 50% | 0.000000 | 1.695000e+03 | 2.105912e+06 |
| 75% | 2.000000 | 5.157250e+03 | 2.108803e+06 |
| max | 404.000000 | 5.002896e+06 | 2.111792e+06 |
Sum & Average (Mean) Functions in Both Python & Google Sheets#
Statistics such as the sum and average (mean) can also be calculated by writing the proper Python code. Let’s use the FT2 column as an example. It shows the square footage of the different building and demolition projects listed in the dataset.
Here are the functions expressed in \(\LaTeX\) and Python. PUT REFERENCE TO MODULE 7 AND THE RESOURCES FROM ASSIGNMENT 7.
Sum#
\(\LaTeX\)
Python
permits['FT2'].sum()
Average#
\(\LaTeX\)
Python
permits['FT2'].mean()
Warning
Since I have worked with this dataset in the past, I know that there are some duplicates. These duplicates were created by city staff on purpose in order to individually list the various street addresses attached to the same permit number, for example in the case of row houses. Therefore, at this point, our sum and average functions will include these duplicates.
Let’s run the two Python functions in code cells to see the results.
permits['FT2'].sum()
66244707
permits['FT2'].mean()
4706.2167519181585
To find the sum and average of the FT2 column in Google Sheets, we can use the SUM and AVERAGE functions. They look something like this:
=SUM(M2:M14077)
=AVERAGE(M2:M14077)
We need to manually tell Google Sheets to do the required math on cells M2 through M14077, since FT2 is column N, and it features 14077 rows, including the headings.
Fig. 6 Screen capture showing the SUM and AVERAGE functions used in Google Sheets.#
“Column Stats” in Google Sheets#
Alternatively, we can select the entire column we are interested in, and then click on Data > Column stats.
Fig. 7 Screen capture of the column stats information box.#
Whether we use .describe(), Python functions, Google Sheets functions, or Google Sheets’s column stats option, the summary statistics for column M (FT2) are the same.
Here too, I would argue that Google Sheets is perhaps easier to use. However, Python presents the advantage of quickly generating a table with summary statistics by writing a simple line of code.
Showing the Distribution With Altair#
We can also use Altair to generate plots showing the frequency distribution of the values in various columns. Let’s focus on WARD. We will create a bar chart and an area chart.
The code cells are hidden, but let’s remember that we first need to import the Altair library using this code:
import altair as alt
Also, since our dataset as more than 5000 rows, we need to include the following line of code, as seen in the Data Visualization course [OTBF21]. I will also be reusing bits of code from my final project for said course.
alt.data_transformers.disable_max_rows()
import altair as alt
from myst_nb import glue
alt.data_transformers.disable_max_rows()
ward_bar = alt.Chart(permits, width=550, height=400).mark_bar().encode(
x=alt.X("WARD:N", sort="y", title="Ward"),
y=alt.Y('count()', title="Number of Projects"))
glue("ward_bar_fig", ward_bar, display=False)
C:\Users\marie\miniconda3\lib\site-packages\altair\utils\core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
Fig. 8 This bar chart shows the number of construction, demolition, and pool enclosure projects in the City of Ottawa’s 23 wards.#
ward_area = alt.Chart(permits, width=550, height=400).mark_area().encode(
x=alt.X("WARD:N", sort="y", title="Ward"),
y=alt.Y('count()', title="Number of Projects"),
color="BLG TYPE:N")
glue("ward_area_fig", ward_area, display=False)
Fig. 9 This area chart is supposed to show the number of projects involving each building type. But there’s a glitch.#
The area chart seen in Fig. 9 should be showing different colours corresponding to the different building types. However, a parsing error leads to all of the building types being classified as undefined. Hence, both Google Sheets and Python have “shortcomings” when it comes to handling larger datasets.
Warning
For an unknown reason, there seems to be a parsing error somewhere, which explains why I am getting an “undefined” category for the building type instead of a list of building types. I have already spent at least one hour trying to solve this issue. Given the scope of this assignment, I decided to leave this error as is. I hope this will nonetheless count as my second chart.